Exploratory Data Analysis
Exploration
Summary Statistics
The following is a summary of the data.
| TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
Plots
The following density plots show the spread of the data. The red verticle line is the mean and the blue verticle line is the median. The scatter plot shows the relationship between wins and the variable
Missing Data
Batting Strike Outs
To fill the missing in the missing data we will alternate between the two modes (578 and 909)
Scaled and Combined
The idea behind this model is that teams that are better than the average will win more games and teams worse than the average will win less. The way we determine if a team is better than average is by looking at how well they preform at batting, pitching, and fielding.
Since there are more than one way to win a baseball game (i.e. have some power sluggers that hit home runs, vs have really good single batters.) we need to combine the various batting measures. Now since getting a strikout at bat is bad, we need to change the sign of this variable. That way it can be combined and will fit the better teams win more and worse teams less model.
We are going to scale all variables . That centers them at 0 and gives them a standard deviation of 1. We can then combine almost all the batting variables into one measure (hit by pitcher is excluded).
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING, data = training)
Residuals:
Min 1Q Median 3Q Max
-69.251 -8.896 0.677 9.421 48.867
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.7909 0.2945 274.38 <2e-16 ***
TEAM_BATTING 2.5623 0.1058 24.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.05 on 2274 degrees of freedom
Multiple R-squared: 0.2051, Adjusted R-squared: 0.2047
F-statistic: 586.6 on 1 and 2274 DF, p-value: < 2.2e-16
This model says that the average baseball team will win about 81 games. If their batting is one standard deviation better than the average they will win 3 more games.